Sampling Methods in Approximate Query Answering Systems

نویسنده

  • Gautam Das
چکیده

INTRODUCTION In recent years, advances in data collection and management technologies have led to a proliferation of very large databases. These large data repositories are typically created in the hope that through analysis, such as data mining and decision support, they will yield new insights into the data and the real-world processes that created it. In practice, however, while the collection and storage of massive data sets has become relatively straightforward, effective data analysis has proven more difficult to achieve. One reason that data analysis successes have proven elusive is that most analysis queries, by their nature, require aggregation or summarization of large portions of the data being analyzed. For multi-gigabyte data repositories, this means that processing even a single analysis query involves accessing enormous amounts of data, leading to prohibitively expensive running times. This severely limits the feasibility of many types of analysis applications, especially those that depend on timeliness or interactivity. While keeping query response times short is very important in many data mining and decision support applications, exactness in query results is frequently less important. In many cases, " ballpark estimates " are adequate to provide the desired insights about the data, at least in preliminary phases of analysis. For example, knowing the marginal data distributions for each attribute up to 10% error will often be enough to identify top-selling products in a sales database or to determine the best attribute to use at the root of a decision tree. For example, consider the following SQL query: This query seeks to compute the total number of a particular item sold in a sales database, grouped by state. Instead of a time-consuming process that produces completely accurate answers, in some circumstances it may be suitable to produce ball-park estimates, e.g. counts to the nearest thousands. The acceptability of inexact query answers coupled with the necessity for fast query response times has led researchers to investigate approximate query answering techniques (AQA) that sacrifice accuracy to improve running time, typically through some sort of lossy data compression. The general rubric in which most approximate query processing systems operate is as follows: first, during the " pre-processing phase " , some auxiliary data structures, or data synopses, are built SELECT State, COUNT (*) as ItemCount FROM SalesData WHERE ProductName = 'Lawn Mower' GROUP BY State ORDER BY ItemCount DESC over the database; then, during the " runtime phase " , queries …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sampling Reloaded

Sampling methods are integral to the design of surveys and experiments, to the validity of results, and thus to the study of statistics, social science, and a variety other disciplines that use statistical data. Many probabilistic sampling methods have been proposed in the literature for capturing large amounts of data succinctly. Such methods are bound by the space & time constraint and find d...

متن کامل

ICICLES: Self-Tuning Samples for Approximate Query Answering

Approximate query answering systems provide very fast alternatives to OLAP systems when applications are tolerant to small errors in query answers. Current sampling-based approaches to approximately answer aggregate queries over foreign key joins suffer from the following drawback. All tuples in relations are deemed equally important for answering queries even though, in reality, OLAP queries e...

متن کامل

Histogram-Based Approximation of Set-Valued Query-Answers

Answering queries approximately has recently been proposed as a way to reduce query response times in on-line decision support systems, when the precise answer is not necessary or early feedback is helpful. Most of the work in this area uses sampling-based techniques and handles aggregate queries, ignoring queries that return relations as answers. In this paper, we extend the scope of approxima...

متن کامل

Cooperative Query Answering for Approximate Answers with Nearness Measure in Hierarchical Structure Information Systems

COOPERATIVE QUERY ANSWERING FOR APPROXIMATE ANSWERS WITH NEARNESS MEASURE IN HIERARCHICAL STRUCTURE INFORMATION SYSTEMS Thanit Puthpongsiriporn, Ph.D. University of Pittsburgh Cooperative query answering for approximate answers has been utilized in various problem domains. Many challenges in manufacturing information retrieval, such as: classifying parts into families in group technology implem...

متن کامل

Data Sketch/Synopsis

1. Aggarwal C.C. On biased reservoir sampling in the presence of stream evolution. In Proc. 32nd Int. Conf. on Very Large Data Bases, 2006. 2. Chaudhuri S. et al. Overcoming limitations of sampling for aggregation queries. In Proc. 17th Int. Conf. on Data Engineering, 2001. 3. Ganti V., Lee M.-L., and Ramakrishnan R. ICICLES: Self-tuning samples for approximate query answering. In Proc. 28th In...

متن کامل

Sampling-Based Cardinality Estimation Algorithms: A Survey and An Empirical Evaluation

Cardinality estimation is a fundamental problem that has been studied for several decades in database community. It has wide applications in many database management issues such as query optimization, query monitoring, query progress indicator, query execution time prediction, and approximate query answering. Existing cardinality estimation techniques can generally fall into two categories: i) ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009